# import all packages and set plots to be embedded inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.graph_objects as go
import plotly.express as px
import matplotlib.dates as mdates
%matplotlib inline
# load in the dataset into a pandas dataframe, print statistics
tripdata = pd.read_csv('tripdata.csv')
# Convert start_time and end_time to datetime for analysis
tripdata['start_time'] = pd.to_datetime(tripdata.start_time, format='%Y-%m-%d %H:%M:%S')
tripdata['end_time'] = pd.to_datetime(tripdata.end_time, format='%Y-%m-%d %H:%M:%S')
# high-level overview of data shape and composition
print(tripdata.shape)
print(tripdata.dtypes)
print(tripdata.head(10))
(183412, 16)
duration_sec int64
start_time datetime64[ns]
end_time datetime64[ns]
start_station_id float64
start_station_name object
start_station_latitude float64
start_station_longitude float64
end_station_id float64
end_station_name object
end_station_latitude float64
end_station_longitude float64
bike_id int64
user_type object
member_birth_year float64
member_gender object
bike_share_for_all_trip object
dtype: object
duration_sec start_time end_time \
0 52185 2019-02-28 17:32:10.145 2019-03-01 08:01:55.975
1 42521 2019-02-28 18:53:21.789 2019-03-01 06:42:03.056
2 61854 2019-02-28 12:13:13.218 2019-03-01 05:24:08.146
3 36490 2019-02-28 17:54:26.010 2019-03-01 04:02:36.842
4 1585 2019-02-28 23:54:18.549 2019-03-01 00:20:44.074
5 1793 2019-02-28 23:49:58.632 2019-03-01 00:19:51.760
6 1147 2019-02-28 23:55:35.104 2019-03-01 00:14:42.588
7 1615 2019-02-28 23:41:06.766 2019-03-01 00:08:02.756
8 1570 2019-02-28 23:41:48.790 2019-03-01 00:07:59.715
9 1049 2019-02-28 23:49:47.699 2019-03-01 00:07:17.025
start_station_id start_station_name \
0 21.0 Montgomery St BART Station (Market St at 2nd St)
1 23.0 The Embarcadero at Steuart St
2 86.0 Market St at Dolores St
3 375.0 Grove St at Masonic Ave
4 7.0 Frank H Ogawa Plaza
5 93.0 4th St at Mission Bay Blvd S
6 300.0 Palm St at Willow St
7 10.0 Washington St at Kearny St
8 10.0 Washington St at Kearny St
9 19.0 Post St at Kearny St
start_station_latitude start_station_longitude end_station_id \
0 37.789625 -122.400811 13.0
1 37.791464 -122.391034 81.0
2 37.769305 -122.426826 3.0
3 37.774836 -122.446546 70.0
4 37.804562 -122.271738 222.0
5 37.770407 -122.391198 323.0
6 37.317298 -121.884995 312.0
7 37.795393 -122.404770 127.0
8 37.795393 -122.404770 127.0
9 37.788975 -122.403452 121.0
end_station_name end_station_latitude \
0 Commercial St at Montgomery St 37.794231
1 Berry St at 4th St 37.775880
2 Powell St BART Station (Market St at 4th St) 37.786375
3 Central Ave at Fell St 37.773311
4 10th Ave at E 15th St 37.792714
5 Broadway at Kearny 37.798014
6 San Jose Diridon Station 37.329732
7 Valencia St at 21st St 37.756708
8 Valencia St at 21st St 37.756708
9 Mission Playground 37.759210
end_station_longitude bike_id user_type member_birth_year \
0 -122.402923 4902 Customer 1984.0
1 -122.393170 2535 Customer NaN
2 -122.404904 5905 Customer 1972.0
3 -122.444293 6638 Subscriber 1989.0
4 -122.248780 4898 Subscriber 1974.0
5 -122.405950 5200 Subscriber 1959.0
6 -121.901782 3803 Subscriber 1983.0
7 -122.421025 6329 Subscriber 1989.0
8 -122.421025 6548 Subscriber 1988.0
9 -122.421339 6488 Subscriber 1992.0
member_gender bike_share_for_all_trip
0 Male No
1 NaN No
2 Male No
3 Other No
4 Male Yes
5 Male No
6 Female No
7 Male No
8 Other No
9 Male No
# descriptive statistics for numeric variables
print(tripdata.describe())
duration_sec start_station_id start_station_latitude \
count 183412.000000 183215.000000 183412.000000
mean 726.078435 138.590427 37.771223
std 1794.389780 111.778864 0.099581
min 61.000000 3.000000 37.317298
25% 325.000000 47.000000 37.770083
50% 514.000000 104.000000 37.780760
75% 796.000000 239.000000 37.797280
max 85444.000000 398.000000 37.880222
start_station_longitude end_station_id end_station_latitude \
count 183412.000000 183215.000000 183412.000000
mean -122.352664 136.249123 37.771427
std 0.117097 111.515131 0.099490
min -122.453704 3.000000 37.317298
25% -122.412408 44.000000 37.770407
50% -122.398285 100.000000 37.781010
75% -122.286533 235.000000 37.797320
max -121.874119 398.000000 37.880222
end_station_longitude bike_id member_birth_year
count 183412.000000 183412.000000 175147.000000
mean -122.352250 4472.906375 1984.806437
std 0.116673 1664.383394 10.116689
min -122.453704 11.000000 1878.000000
25% -122.411726 3777.000000 1980.000000
50% -122.398279 4958.000000 1987.000000
75% -122.288045 5502.000000 1992.000000
max -121.874119 6645.000000 2001.000000
183412 trips are recorded in this dataframe with 14 columns(duration_sec, start_time, end_time, start_station_id, start_station_name, start_station_latitude, start_station_longitude, end_station_id, end_station_name, end_station_latitude, end_station_longitude, bike_id, user_type, member_birth_year, member_gender, and bike_share_for_all_trip).
There are 2 types of user_type, which are Customer and Subscriber.
Male, Female, and Other are present as a measure to store customer's gender information.
I'm interested in fiding out how long the average trip takes. Also, when are most trips taken in terms of time of day, day of the week, or month of the year? Does the above depend on if a user is a subscriber or customer?
I expect that duration_sec, start_time, end_time,and user_type will a big role in this data exploration. There could also be a reationship between 3 genders in this dataset. I expect that the Latitude and user's age plays a big role as well.
I'll start by looking at the distribution of the main variable of interest: duration.
# Let's start with a standard-scaled plot
fig = px.histogram(
data_frame=tripdata,
# Set up the x-axis
x="duration_sec",
title='Trip Length Distribution',
# Set the number of bins
nbins=400)
# Show the plot
fig.show()
# There's a long tail in the distribution, so let's put it on a log scale instead
fig = px.histogram(
data_frame=tripdata,
# Set up the x-axis
x="duration_sec",
title='Trip Length Distribution (log_x=True)',
# Logarithmic axes with Plotly Express
log_x=True)
# Show the plot
fig.show()
Duration has a long-tailed distribution, with a lot of users on the low duration end, and few on the high duration end. When plotted on a log-scale, the duration distribution looks right-skewed, with one peak between 350 and 390. Interestingly, there's a steep decline in frequency right after 2000.
Next up, the first predictor variable of interest: start_time.
# Plotting start_time on a standard scale
fig = px.histogram(
data_frame=tripdata,
x="start_time",
nbins=100)
fig.show()
# Investigating further on an even smaller bin size
fig = px.histogram(
data_frame=tripdata,
x="start_time",
nbins=200)
fig.show()
The smaller bin size provides us with a lot more information in general to analyse.
# Generate day counts for February
tripdata['start_time_date'] = tripdata['start_time'].dt.to_period('D').astype(str)
# Investigating further by day
fig = px.histogram(
data_frame=tripdata,
x="start_time_date",
title='Trip Start Time (by Date)')
fig.show()
# Generate day counts for days of week
start_time_day = tripdata['start_time'].dt.dayofweek.astype(str)
start_time_day = start_time_day.replace({"0" : 'Monday', "1" : 'Tuesday', "2": 'Wednesday', "3" : 'Thursday', "4" : 'Friday', "5" : 'Saturday',
"6" : 'Sunday'})
# Inserting the values into the tripdata dataframe
tripdata['start_time_day'] = start_time_day
# Investigating further by day
fig = px.histogram(
data_frame=tripdata,
x="start_time_day",
title='Trip Start Time (by Day of Week)')
# Fixing the x-axis order
fig.update_xaxes(categoryorder='array', categoryarray= ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday'])
fig.show()
In the case of start_time, the small bin size proves very illuminating. There are very large spikes in frequency at the bars at specific time of the day (e.g. 3PM-9PM); frequency quickly trails off until the next spike. These probably represent the busiest hours in a day.
If we take a look at our data on a dayly basis, there's a spike during the first and third week. Finally, we can conclude that there are signigicantly less users during the weekend with Thursday being the most popular day of all.
I'll now move on to the other variable in the dataset: Gender.
fig = px.histogram(
data_frame=tripdata,
x="member_gender")
# Fixing the x-axis order
fig.update_xaxes(categoryorder='array', categoryarray= ['Male', 'Female', 'Other'])
fig.show()
# Calculating the proportion of male, female, and other users.
print('Male:', tripdata[tripdata['member_gender'] == 'Male'].count()[0] / tripdata.count()[0])
print('Female:', tripdata[tripdata['member_gender'] == 'Female'].count()[0] / tripdata.count()[0])
print('Other:', tripdata[tripdata['member_gender'] == 'Other'].count()[0] / tripdata.count()[0])
Male: 0.7123361612108259 Female: 0.222689900333675 Other: 0.019911456175168474
fig = px.histogram(
data_frame=tripdata,
x="user_type")
# Fixing the x-axis order
fig.update_xaxes(categoryorder='array', categoryarray= ['Subscriber', 'Customer'])
fig.show()
# Calculating the proportion of types of users.
print('Subscriber:', tripdata[tripdata['user_type'] == 'Subscriber'].count()[0] / tripdata.count()[0])
print('Customer:', tripdata[tripdata['user_type'] == 'Customer'].count()[0] / tripdata.count()[0])
Subscriber: 0.8916755719364055 Customer: 0.10832442806359453
On this platform, this is significantly that almost 71% of users are male. Women consist of 22% of the userbase in total, and approximately 2% of users identify themselves as other.
On top of that, 90% of users are subscribers, while only 10% of them are customers.
The duration variable took on a large range of values, so I looked at the data using a log transform. Also, the days of the week and date columns had to be changed to strings so that I could generate the histograms.
During the weekend, the number of users significantly drops, indicating that there's something that are preventing them from using bikes.
To start off with, I want to look at the pairwise correlations present between features in the data.
cr = tripdata.corr(method='pearson')
fig = go.Figure(go.Heatmap(
x=cr.columns,
y=cr.columns,
z=cr.values.tolist(),
colorscale='rdylgn', zmin=-1, zmax=1))
fig.show()
There's a strong positive relationship between end_station_latitude and start_station_latitude. Also between end_station_longitude and start_station_longitude. On the other hand, we can also see a strong negative relationship betwen start_station_longitude and start_station_latitude and end_station_longitude and start_station_latitude, as well as end_station_longitude and end_station_latitude.
# Generating the box plots for each variable.
tripdata_samp = tripdata.sample(n=3000, replace = False)
def boxgrid(x, y, **kwargs):
""" Quick hack for creating box plots with seaborn's PairGrid. """
default_color = sb.color_palette()[0]
sb.boxplot(x=x, y=y, color=default_color, showfliers = False)
plt.figure(figsize = [10, 10])
g = sb.PairGrid(data = tripdata_samp, y_vars = ['duration_sec', 'member_birth_year'], x_vars = ['member_gender', 'start_time_day'],
height = 4, aspect = 1.6)
g.map(boxgrid)
plt.show();
<Figure size 720x720 with 0 Axes>
We can see that there's no significant differences in these plots. It is possible that users are a bit younger on Tuesdays and Fridays than usual, as the lower side of IQR goes down more than the other days of the week.
plt.scatter(data = tripdata, x = 'member_birth_year', y = 'duration_sec');
plt.xlabel('Birth Year')
plt.ylabel('Duration')
Text(0, 0.5, 'Duration')
Birth Year had a surprisingly high amount of correlation with the duration of the ride. An approximately exponential relationship was observed when duration was plotted. Box plots tell us that there aren't huge differences across the gender of user, and the day of the week.
There was also an interesting relationship observed between start_time and end_time. start_station_longitude. On the other hand, we can also see a strong negative relationship betwen start_station_longitude and start_station_latitude and end_station_longitude and start_station_latitude, as well as end_station_longitude and end_station_latitude.
plt.hist2d(data = tripdata, x = 'start_time', y = 'member_birth_year', weights = 'duration_sec',
cmap = 'viridis_r');
plt.xlabel('Start Time')
plt.ylabel('Birth Year');
plt.colorbar(label = 'Duration');
# select duration of approximately 1 hour
tripdata_hour = (tripdata['duration_sec'] >= 3550) & (tripdata['duration_sec'] <= 3650)
tripdata_one = tripdata.loc[tripdata_hour,:]
fig = plt.figure(figsize = [13,6])
ax = sb.pointplot(data = tripdata_one, x = 'start_time', y = 'duration_sec', hue = 'start_time_day',
palette = 'Blues', linestyles = '', dodge = 0.4)
plt.ylabel('duration_sec')
plt.xticks(rotation=30)
plt.locator_params(nbins=8)
plt.show();
I extended my investigation of start time against duration in this section by looking at the impact of the three categorical quality features. The multivariate exploration here showed that there is an increased number of values on birth time when younger, but in the second plot, it is hard to see any relationship from this one.
Looking at the point plots, it doesn't appear that the three category features have a systematic interaction impact. The features, on the other hand, aren't completely self-contained. However, it's fascinating to see how the start time plot for duration relates to the days of the week.
In this data investigation, I expect duration sec, start time, end time, and user type to play a significant role. In this dataset, there could be a relationship between three genders. I believe that the user's age and latitude will also play a significant effect. As I predicted, duration has a long-tailed distribution, with a large number of users on the low end and a small number on the high end. The duration distribution appears right-skewed when shown on a log-scale, with one peak between 350 and 390. Surprisingly, following the year 2000, there is a significant drop in frequency.
The first predictor variable to look at is start time. The small bin size is particularly useful in the case of start time. At various times of the day (e.g. 3PM-9PM), there are big spikes in frequency at the bars, which quickly fade away until the next surge. These are most likely the busiest hours of the day. When we look at our data on a daily basis, we can see that there is a rise in the first and third weeks. Finally, we may deduce that weekend usage is significantly lower, with Thursday being the most popular day of the week.
Next, we'll look at the gender variable in the dataset. The fact that over 71 percent of users on this network are men is important. In all, women make up 22% of the user base, and about 2% of users identify as other. Furthermore, ninety percent of users are subscribers, whereas only ten percent are customers.
Because the duration variable had such a wide range of values, I used a log transform to examine the data. In order to construct the histograms, the days of the week and date columns have to be transformed to strings. The number of users reduces dramatically over the weekend, indicating that something is stopping them from riding their bikes.
The length of the ride was interestingly correlated with the year of birth. When the duration was plotted, an essentially exponential connection was discovered. Box plots show that there aren't many variances depending on the user's gender and the day of the week. There was also a fascinating link between start time and end time, which was called start station longitude. On the other hand, there is a strong negative association between start station longitude and start station latitude, end station longitude and start station latitude, and end station longitude and end station latitude, as well as end station longitude and end station latitude.
Between start time and end time, which is linear, there is a significant positive link.